Skip to content

Comments

Add multimodal embedding & rerank support#66

Draft
roj234 wants to merge 2 commits intoJamePeng:mainfrom
roj234:vl-embedding
Draft

Add multimodal embedding & rerank support#66
roj234 wants to merge 2 commits intoJamePeng:mainfrom
roj234:vl-embedding

Conversation

@roj234
Copy link

@roj234 roj234 commented Feb 21, 2026

It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)

@JamePeng
Copy link
Owner

JamePeng commented Feb 21, 2026

It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx.
If possible, please provide necessary example and test code to illustrate its usage.

@roj234
Copy link
Author

roj234 commented Feb 21, 2026

Actually I am enhance the existing Embedding class, however I can move mctx management to llama_embedding.py
About memory, I have referred your context_stack and __del to free memory.
I also found llama_chat_format contains the logic for multimodal processing, but it is tightly coupled with the inference execution. It doesn't expose a way to get the processed tokens.

btw, Here is my usage

                    doc = [{"type": "text", "text": f"Name: {filepath.name}"}, 
                           {"type": "image", "image": image_data}]

class RAGModel:
    def __init__(self):
        self._model = LlamaEmbedding(
            # ...
            mmproj_path=...,
            image_min_tokens=...,
            image_max_tokens=...,
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        files = []

        image_id = 0
        # Should not manually concat chat template here...
        tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
        for item in contents:
            type = item['type']
            if type == 'text':
                tmpl += item['text']
            elif type == 'image':
                image_id += 1
                files.append(item['image'])
                tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd

        return tmpl +  "<|im_end|>\n<|im_start|>assistant\n", files

    def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
        text, files = self._tmpl(contents, instruction)
        return self._model.embed_multimodal(text, files, return_count=return_count)

@JamePeng
Copy link
Owner

Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage.

(cherry picked from commit 4ba212f)
@roj234
Copy link
Author

roj234 commented Feb 24, 2026

by from llama_cpp.mtmd import Jinja2MultimodalChatFormatter RAGModel can be

def __init__():

        eos_token_id = self._model.token_eos()
        bos_token_id = self._model.token_bos()

        eos_token = (
            self._model._model.token_get_text(eos_token_id) if eos_token_id != -1 else ""
        )
        bos_token = (
            self._model._model.token_get_text(bos_token_id) if bos_token_id != -1 else ""
        )

        self._formatter = Jinja2MultimodalChatFormatter(
            template=self._model.metadata['tokenizer.chat_template'],
            eos_token=eos_token,
            bos_token=bos_token,
            stop_token_ids=[eos_token_id]
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        result = self._formatter([{
            "role": "system",
            "content": instruct
        }, {
            "role": "user",
            "content": contents
        }])

        return result.prompt, result.medias

Contents can be image or audio, support local disk, network url, or bytes/bytearray instance, no video support yet. I thought create_completion is too complex, too, I will create alternate function instead (avoid breaking change)

@JamePeng
Copy link
Owner

你好 @roj234 ,这个PR可以保持继续适配优化,我要先对batch decode和eval的部分进行一些重构,原来的老的执行逻辑会有不对齐的情况,导致新模型运行第一轮后kv cache对不上,这次叠加ggml-org/llama.cpp@2b6dfe8 的变更,我就干脆按照llama.cpp目前比较新的方式进行重构,这会对Embedding的部分有一些干扰,但应该是值得的。

@roj234
Copy link
Author

roj234 commented Feb 24, 2026

好的,我计划的修改是,除了添加的LlamaEmbedding.embed_multimodal函数之外,再创建一个类似Llama.create_multimodal_chat_completion的函数,它能直接处理请求中的image\audio或者未来的video对象(我看了Qwen VL的代码,它的video实现是用ffmpeg把视频切成nFPS的图片序列,不过不排除未来有新方式的可能,那时候要看mtmd库怎么实现了)
当然,我认为最好(指删掉过时代码,但不再向前兼容)的方式是重构create_chat_completion并删掉llama_chat_format中那几千行的模板和历史包袱,比如把他们做成template中按名称命名的Jinja模板,顺便解决Llama->chat_format->Llama奇怪的调用链

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants